ML Problem

Question

What does a typical ML problem look like?

Analogy

Think about arithmetic classes in primary school. During the class hours, a student looks at solved examples in a textbook and how to solve simple three digit addition problems. Let us say that her textbook has the following problems along with the answers:

\(103+205=308\)
\(123+409=532\)
\(185+483=668\)

During the instructional hours, the student has access to both the questions and the answers. In the exam, she will not have access to the answers. But more importantly, she will not even be asked the same questions! So, just memorizing the answers will not help. She would have to learn how addition works. She needs to have a mental model of addition. In other words, she would have to learn a function from the input (question) to the output (answer).

This is exactly what happens in a regression problem. The inputs are a set of data-points. The outputs corresponding to these inputs are real numbers called targets or labels. A regression model has to learn the mapping from input to output. Once this mapping or function is learnt, the model can then be used to predict the output on unseen inputs. The collection of data-points along with their targets is called a labeled dataset. A regression model makes use of this dataset to learn a function. A labeled dataset is nothing but the textbook problems in our analogy.

Data Representation

We are given a collection of \(n\) data-points and \(n\) labels. Each data-point is described by \(d\) features. For example, in the housing dataset, each house is a data-point and is described by features such as latitude, longitude, area and so on. These features are clubbed together in a feature-vector of size \(d\). Arranging the \(n\) data-points in a matrix, we get a \(n\times d\) data-matrix. Let us call this matrix \(\mathbf{X}\):

\[ \begin{equation*} \mathbf{X} =\begin{bmatrix} x_{11} & - & x_{1d}\\ | & x_{ij} & | \\ x_{n1} & - & x_{nd} \end{bmatrix} \end{equation*} \]

Each row of this matrix is the feature vector for one data-point. The element \(x_{ij}\) is the \(j^{th}\) feature of the \(i^{th}\) data-point. The labels can be put together in a vector of size \(n\). Let us call this \(\mathbf{y}\):

\[ \begin{equation*} \mathbf{y} =\begin{bmatrix} y_{1}\\ \vdots \\ y_{n} \end{bmatrix} \end{equation*} \]

Remark

We will use the following conventions to represent scalars, vectors and matrices:

Scalars are represented by small letters in normal font: \(\displaystyle x,y,z\).
Vectors are represented by small letters in bold-faced font: \(\displaystyle \mathbf{x} ,\ \mathbf{y} ,\ \mathbf{z}\).
Matrices are represented by capital letters in bold-faced font: \(\displaystyle \mathbf{X} ,\mathbf{Y} ,\mathbf{Z}\)

Model

As stated earlier, a regression model can be viewed as a function that transforms a data-point into a label. Formally:

\[ \begin{equation*} f:\mathbb{R}^{d}\rightarrow \mathbb{R} \end{equation*} \]

Each feature-vector is of size \(d\). So, the feature-vectors reside in the \(d\) dimensional space \(\mathbb{R}^{d}\). The labels are real numbers, so they reside in \(\mathbb{R}\). Mathematically, this is the action of a model on a data-point \(\mathbf{x}\):

\[ \begin{equation*} y=f(\mathbf{x} ) \end{equation*} \]

Pictorially:

flowchart LR
  A[Feature Vector] --> B[Model]
  B[Model] --> C[Label]

What is so special about a ML model? All models take some input and produce a corresponding output. The key difference is that in a classical programming setting, we are given the input and the function and are asked to find the output. In machine learning we are given both the input and the output using which have to learn a model \(f\). The function or the model is the unknown. This is what has to be learnt. This is one reason machine learning is often associated with the phrase “learning from data”.

Learning

ML is all about learning from data. But who or what is learning? More importantly, who or what enables the learning? There is a learning algorithm which drives the learning. We can think of the model as the outcome of the learning process. During the learning stage, the dataset is fed as input to a learning algorithm, which in turn outputs a model.

flowchart LR
  A[Labeled Dataset] --> B[Learning Algorithm]
  B[Learning Algorithm] --> C[Model]

There is one important detail that is missing in this diagram. There are several models that we could choose from. Going back to our analogy, there are different ways to understand three digit addition:

representing numbers as counts of physical objects
representing numbers as abstract entities that can be manipulated

Teachers might choose the first model to help kids understand addition. We all would have come across this example at some point: “If I have two chocolates in my left hand and five in my right hand, how many do I have in total?” As kids grow up, teachers might move to the second model, which is more sophisticated. Something similar happens in ML. We are the teachers for the machines. Our responsibility is to choose a family of models and present it to the machine.

flowchart LR
  A[Labeled Dataset] --> B[Learning Algorithm]
  D[Model Family] --> B[Learning Algorithm]
  B[Learning Algorithm] --> C[Model]

Thus there are two inputs to the learning algorithm:

labeled dataset
family of models

The task of the algorithm is to explore the space of models and pick the one that best fits the labeled dataset. ML scientists have come up with a variety of models. The simplest such model is a linear model. We shall take this up in subsequent chapters.

Remark

For some regression models, once you have learnt the model, you can throw away the dataset (textbook). This is not true of all regression models though! Think about how you learnt three-digit addition. Do you still carry your primary school textbooks around? No! Your mind has a representation of what addition is. This representation is what we call a model.

Summary

Regression is a classic ML problem that uses labeled data to learn a mapping from a feature-vector to a real number. The data-points are arranged in a data-matrix called \(\mathbf{X}\). The labels are arranged in a label vector called \(\mathbf{y}\). A model is a function that transforms a feature-vector to a label.